Rejoinder : The Madness to Our Method : Some Thoughts on Divergent Thinking
نویسندگان
چکیده
In this reply, the authors examine the madness to their method in light of the comments. Overall, the authors agree broadly with the comments; many of the issues will be settled only by future research. The authors disagree, though, that past research has proven past scoring methods—including the Torrance methods—to he satisfactory or satisfying. The authors conclude by offering their own criticisms of their method, of divergent thinking, and of the concept of domain-general creative abilities. Article: It is unoriginal to start a reply by thanking the scholars who wrote comments, but sometimes originality is overrated: We appreciate the care that went into these detailed and insightful comments. But we hope that this reply does not degenerate into a stereotype, in which we defend each of our claims to the death—indeed, many of our claims are probably already dead. Instead, we would like to use the comments as a starting point for exploring some new ideas and complex issues that could not be squeezed into our overly long target article (Silvia et al., 2008). The first sections of this reply consider issues raised and inspired by each comment. The last section considers our own criticisms of our scoring method, of divergent thinking, and of the notion of domain-general creative abilities. By the end, we hope that readers of this exchange have a stronger sense of unresolved issues and future directions for divergent thinking research. What About the Torrance Tests? Kyung Hee Kim (2008) provided a nice overview of the Torrance Tests, which differ from other classes of divergent-thinking tests in some important ways. This comment offers a chance to consider some issues, strengths, and weaknesses of the Torrance Tests. Before doing so, though, we tackle a few minor points of clarification. Kim broke our objections down to four points; we modestly take issue with a couple of them. First, as an aside, we didn’t (and wouldn’t) argue that divergent-thinking tests are too highly correlated with general intelligence. Perhaps Simonton’s (2003) quotation in our introduction—used to illustrate a famous researcher's skeptical position—gave this impression. Regardless, we agree with Kim (2008), and disagree with Simonton, on this point; Kim’s (2005) fine meta-analysis has settled this classic issue. Second, we do believe that divergent-thinking tests have changed little since the 1960s. This, however, is merely our opinion, not an evidence-based argument. (Some things do not change because they are already good: The Likert scale, for example, has not changed much since Rensis Likert and Gardner Murphy developed it in the 1930s.) Certainly, some creativity tests have been renormed, and some scoring systems have been created, refined, or jettisoned. But the core psychometric structure of these tests seems essentially the same: Researchers ask for responses and then quantify performance on the basis of some form of uniqueness. This is ultimately an opinion—we will leave it to readers to develop their own opinions. Third, we're aware that the Torrance Tests instruct people to be creative. We see this as a strength of Torrance,s approach. And fourth, we certainly believe that fluency scores and uniqueness scores are highly correlated—so high, in fact, that they are confounded. We thus would not agree with Kim,s (2008) interpretation of our first study: "Silvia et al. could not avoid the high positive correlations between fluency scores and uniqueness scores in their study although they intended to solve their perceived problem." Our aim was not to create a better, refined uniqueness score; instead, the results of our Study 1 provide one more demonstration of the confounding of fluency and uniqueness. Our attempt to solve this problem (among others) was the testing of subjective scoring methods. Do the Torrance Tests solve the problems that other divergent thinking tests face? The Torrance approach solves a few, we think. Instructing participants to be creative is essential to valid scores, in our view. The Torrance approach also uses a different sense of uniqueness: People receive points for "not unoriginal" responses, so the threshold for originality does not shift with sample size. Unlike the traditional Wallach and Kogan (1965) scoring system or the threshold scoring systems described by Runco (2008), the Torrance approach avoids the "large sample penalty" that we described in out target article. But there is still a huge problem, one of the central problems that motivated our research. Originality scores correlate too highly with fluency scores, particularly in the Torrance Verbal tests. According to the latest Torrance (2008) norms, the median correlation between originality and fluency is r = .88. Kim (2008) skirted this issue, which is generally skirted in discussion of the Torrance tests. We understand that Torrance and others have argued that creative people can generate a lot of responses, but this feels like an admission of psychometric defeat: The unacceptably high overlap between originality and fluency forces this conclusion. Originality and fluency simply are not distinctly assessed by the Torrance approach—the scores are interchangeable. We find the high correlation between fluency and originality/ uniqueness to be too high. At this level of covariance, one score is simply a coarsened measure of the other: Interpreting them as unique, distinct constructs strikes us as folly. Campbell and Fiske's (1959) multitrait–multimethod (MTMM) approach to construct validation supports our assertion: Discriminant validity is necessary for meaningful distinctions between constructs. But we leave it to the gentle reader to draw his or her own conclusion. Does any reader find this correlation acceptable? Does anyone see this correlation and conclude "What a relief! For a moment, I was afraid that fluency and originality measured the same thing." In any other area of research, would a correlation around r = .90 be evidence for discriminant validity? The Torrance Tests solve some of the problems that other divergent thinking tests face, but they do not solve this serious, fundamental problem of measurement. What about Torrance’s figural tests? We did not have the space to discuss them in our target article, but there are some interesting issues surrounding them. Most generally, can the figural tests compensate for some of the problems with the verbal tests? As Kim (2008) pointed out, some of the figural tests control for the number of responses, thus separating quality from amount. And the figural tests can be scored for additional dimensions of performance, thereby enriching the verbal tests. But here’s the rub: The Torrance verbal tests and figural tests correlate modestly, at best—they are not indicators of the same higher-order construct (i.e., creativity, creative potential, or divergent thinking). In the Torrance data analyzed by Plucker (1999), a latent Figural factor correlated only r = .36 with a latent Verbal factor. Clapham (2004) found an identical correlation (r = .36), eerily enough, in a study of observed rather than latent Verbal and Figural scores. The modest correlation between the verbal and figural tests is not necessarily a problem: In multifacteted intellectual assessments, researchers expect (and desire) some components to be weakly related (e.g., vocabulary knowledge and spatial reasoning). But researchers do not then treat the components as interchangeable, nor do they use the strengths of one to compensate for the other. When the verbal tests are criticized, the figural tests should not be raised in response, and vice versa. If the tests do not substantially measure the same underlying construct, then they are not interchangeable. In MTMM terms, the verbal and figural tasks are "different trait, different method," not "same trait, different method." Because they vary in trait (we suspect) and in method, it is hard to compare them. Wrapping up, we disagree that the Torrance tests solve the pressing problems faced by researchers interested in assessing divergent thinking. Nevertheless, we agree that new scoring methods should be compared with the Torrance scoring methods. The validity of a scoring method is ultimately an empirical issue. Perhaps a Torrance-certified rater is interested in reanalyzing the responses? What About Objective Scoring Methods? Mark Runco (2008), too, offered much food for thought, based on his decades of work in divergent thinking. Although a reader of our target article and of his comment may not believe it, we agree with Runco about most of his points and his global approach. We differ, though, in how we think of construct validity and in our interpretation of past research. Runco pointed to threshold scoring methods, such as using a threshold of 5% or 10% for uncommon responses. Threshold scoring methods go back at least half a century, and they have some virtues—but how well do they work? Many scoring systems have some desirable features; Runco himself, over the years, has developed and tested approaches that overcome some of the limitations to the systems favored by Torrance and by Wallach and Kogan (1965). But how well do they work? In our view, there is not a lot of work that suggests strong evidence for validity. We do not mean that there is no evidence; to the contrary, there are many studies that show small-to-medium positive effects that are independent of fluency. If we want tasks that yield small, positive effects (based on small-sample studies), then we have found our tasks. If we want tasks that yield scores of this nature, then these tasks are valid for the purpose of obtaining small, consistent effects. (We would be curious to see what a meta-analysis has to say; our interpretation of the literature is a subjective rating, although we think that Baer [2008], would agree with us.) Either divergent thinking has only small and medium effects, or these assessment methods need refinement. We suspect that divergent thinking's true effects are larger but obscured by measurement error. We also doubt that 5% and 10% threshold scoring systems solve the sample-size problem faced by uniqueness scoring. For a given participant, the chance of having a unique response (i.e., a response that passes the 5% or 10% cutoff) increases as his or her number of responses increases. Top Two scores, in contrast, have modest and occasionally negative correlations between creativity and fluency. Moreover, there is little work that compares several scoring methods against each other; Mark Runco (2008) has done most of it, and this is the kind of research that needs to be done. We should point out that researchers can obtain our raw data, including the thousands of raw responses and rater scores. These data are ideal for comparing scoring systems. Researchers can rescore the responses using the Torrance approach, threshold approaches, and the percent-original system described by Runco. With these scores, we can compare the evidence for the validity of the scoring systems. Which system yields stronger relationships? And although we sound repetitive, we again point out that the correlation between fluency and originality is extremely high, around r = .90. If large-sample studies—such as Torrance,s (2008) normative data and Wallach and Kogan's (1965) data (Silvia, 2008)—show such correlations, the lower correlations in small-sample studies are probably underestimated. No statistical control can be done here: There is too little unique variance. We thus agree and disagree with Runco,s (2008) view that "[a]s a matter of fact, the unique variance of originality and flexibility scores from divergent thinking tests are reliable, even if fluency scores arc statistically controlled." We agree, because our studies showed strong partial effects of creative quality (raters, scores); and we disagree, because few past studies have shown medium or large partial effect sizes. This is a matter of interpretation for now, but we would like to see a meta-analysis. Raters and Their Discontents The issue of raters and the validity of their scores received a lot of attention in the comments. The objectivity of scoring methods is important to many researchers (e.g., Runco’s [2008] comment). We should point out that raters sneak through the back door of many ostensibly objective studies. Whenever researchers screen, drop, pool, compare, collate, or evaluate responses, they are engaging in subjective rating. They may be doing so with only one rater, however, so variance due to raters cannot be modeled. For example, it is common to drop bizarre responses—but bizarre according to whom? Similarly, is the response "make a brick path" the same as "make a brick sidewalk"? This decision is easy, but it isn,t objective in the sense favored by proponents of objective scoring. As we pointed out in our article, validity is what is important about scores, not objectivity, apparent or otherwise. Beyond this point, we agree with the comments that raised questions about how raters are selected, trained, and evaluated. Without a doubt, these issues are complex. Raters will not always agree, and it seems likely that the disagreement will be based on systematic rater-level variables (e.g., expertise, intelligence, creativity, and discernment). What traits or experiences, if any, are necessary to rate effectively? What is the meaning of expertise in rating such responses? Based on the success of computer-based scoring in writing assessment, should we develop "expert system" programs to score the responses, thereby avoiding human raters altogether? In his comment, John Baer (2008) offered a strong defense and a good analysis of the consensual assessment technique, the best-known method involving subjective ratings. We agree with his analysis, apart from one central quibble: Our research did not use the consensual assessment technique, obviously. Our target article mentions the consensual assessment technique in one paragraph—as an example of a method that uses subjective ratings—and our assessment method varies from it in key ways (e.g., using novice raters and choosing the best two scores). Our approach is thus not a use of consensual assessment gone horribly awry, but rather a new method that resembles consensual assessment in that it uses subjective ratings. We suspect that Baer is barking up the wrong but more thoroughly researched tree. Nevertheless, the role of expertise in task scoring is unknown and clearly complex. Unlike tasks used in the consensual assessment technique, divergent-thinking tasks are not a domain of creative accomplishment—they are simple tasks intended to assess traits associated with individual differences in creativity. What, then, would a highly trained expert in divergent-thinking tasks look like? And if our novice raters were inappropriate choices of raters, then would the large effects be even bigger if we found "better" raters? Our studies worked well and found large effects, points that should not be overlooked when considering the soundness of the methods. Soonmook Lee (2008), too, raised some insightful points about sampling raters and assessing their agreement. He pointed out that raters, like participants, are sampled: They can be more or less deviant, and the chance of a deviant rater increases as the number of raters increases. In this case, a couple of trained raters are more desirable than many independent raters; Kim (2008), too, raised this point in the context of the Torrance tests. Baer (2008), in his comment, would disagree: Independence of raters is central to the validity of consensual assessment, he argued. We agree that rater training is worth exploring. In our studies, we did not extensively train the raters: We wanted their scores to be largely independent, along the lines of the consensual assessment technique. As a result, our findings are probably close to the lower end of possible rater agreement. Extensive training aimed at enhancing agreement should be feasible—it,s a good idea for a research project. In short, the validity of subjective ratings must be examined closely and considered seriously. Only the accumulation of research will provide evidence for or against the validity of raters; more likely, it will highlight the training or traits needed for consistent, valid ratings and thus provide guidelines for researchers. (Some of this research could involve rescoring the responses by using different raters and examining how well the groups of raters agreed.) We are intrigued, though, by the possibility of creating expert systems for scoring the responses. Creativity researchers are probably appalled by the notion that software can judge human creativity, but if nothing else, it is objective. A Few Statistical Nuts and Bolts Soonmook Lee (2008) offered some incisive thoughts about psychometric and statistical issues raised by our subjective scoring method. The first set of issues concerns the generalizability analyses. Our original draft had more information about the G and D studies, including a discussion of whether tasks ought to be treated as fixed versus random; most of this work was placed in a Web appendix. We agree that tasks are typically viewed as random and that the models for subjective scores and uniqueness scores differ in key ways. In the end, Study 1 struck us as too small to conclude that the tasks are fixed or random: Our tiny study could not sustain our conclusions. We had only three tasks, one per type, so the tasks and types are confounded. We agree with Lee, but we think that a richer data set is necessary to settle this issue. Along the lines of G coefficients, Lee pointed out that the uniqueness scores would have fared better if they had had a rater facet instead of a single rater. This may be true, but the model of uniqueness scoring, as emphasized in Runco's (2008) comment, is that uniqueness scoring is objective and thus free of rater variance. The question of whether uniqueness scoring would improve with more raters is an empirical one. We expect some differences between raters in their uniqueness decisions, but the magnitude of the variance due to raters is unknown. Lee’s claim that uniqueness scoring would have greater reliability than the other methods can be neither supported nor refuted with the data at hand. We are skeptical of Lee’s (2008) MTMM model of Study 1,s data, in which he concluded that a single scoring factor describes the data. The two subjective scoring methods are based on the same data: The Top Two scores are a subset of the Average scores, so they are highly correlated. The study has only around 75 cases, too, so it didn’t surprise us that Lee’s model failed to converge to an appropriate solution. For what it's worth, there are many well-known problems with estimating MTMM models with structural equation modeling (SEM; Wothke, 1996). The findings from Study 2 show clear differences between the subjective methods, and they provide evidence for their validity. Regarding our second study, we share some of Lee’s (2008) concerns about the higher-order Huge Two model. Although not raised by Lee, the model fit of the Big Five and Huge Two models is not great. Across two large samples, we have found neither the fine model fit nor the null Plasticity—Stability correlation found by other researchers (e.g., DeYoung, 2006)—perhaps it is a Southern thing. Another issue, though, is that the lowerorder Stability factors relate differently to divergent thinking: Agreeableness has a positive effect, but Conscientiousness has a negative effect. The positive effect of Stability confuses these conflicting lower-order effects. Lee noted that the effect of Stability on divergent thinking was not significant. This effect is significant with some methods of estimating standard errors but not with others. Because we lacked a strong reason to use, for example, maximum likelihood with first-order derivatives instead of common maximum likelihood, we used common maximum likelihood for all of our models. This case is a good example of the value of effect sizes, which are stable across model estimation in a way that p values are not. Baer (2008) proposed that subjective scoring methods suffer from a lack of standardization: Researchers cannot obtain scores that could be compared across studies because of differences in samples and raters. Basic research typically does not aspire to precise point estimates of a person’s trait level. Nevertheless, the fields of testequating and item-response theory—both mature areas in psychometric theory—would disagree with Baer. If different samples and raters complicate comparability, then what should we think about adaptive tests (e.g., the Graduate Record Examination) in which people receive the same score despite completing different items? More generally, Mumford, Vessey, and Barrett (2008) pointed out limitations in the scope of evidence for validity, based on Messick,s (1995) model of validity; Baer (2008), too, questioned the value of the evidence for validity that we reported. We agree, of course; only so much can be accomplished in one study or in one article, particularly when the study involves a large sample of people completing time-consuming tasks. The extensions and elaborations that they suggest strike us as good ideas. Perhaps this is another example of Cronbach,s (1957) two disciplines of scientific psychology—our psychometric approach and Mumford et al.,s (2008) experimental cognitive approach—talking past each other. We think Mumford et al. (2008) and Baer (2008) are perhaps too dismissive. Within the model of validity advocated by Messick (1995), validity is not a binary feature of a study, an idea, or an assessment tool. Researchers gain evidence in support of validity over time. Certainly, researchers can quibble with how much evidence there is so far, but validity is not a simple thing you can get from a single study. To date, we have strong evidence for reliability (one kind of evidence for validity) and evidence for associations with domains of personality and college majors, roughly classified. This evidence is part of the field,s broad, half-century interest in personality and creativity, so it is connected to a meaningful stream of theory and research. With regard to validity, however, our studies are an early word, not the last word. Domain General Traits and Potentials Perhaps the thorniest problem in divergent-thinking research is whether divergent-thinking tasks measure a domain-general trait, be it creative cognition, creative potential, or simply creativity. The tradition associated with Guilford (1967), Wallach and Kogan (1965), and Torrance (2008) implies a general trait of creativity. Modern research on divergent thinking agrees (e.g., Plucker, 2005), although the field of creativity is split over the issue of domain-general traits, as Baer (2008) noted in his comment (see also Kaufman & Baer, 2005). In his comment, Nathan Kogan (2008) provided a historical perspective on the assessment of creativity. The Wallach and Kogan (1965) tasks remain hugely popular, but it appears that their theoretical backdrop has been lost. Kogan pointed out that the Wallach and Kogan approach was founded on an associative model of creativity; this model lent meaning to the scores. With regard to our approach, Kogan (2008) argued that without a theory of creative ideation, "the basis for individual differences in [divergent-thinking] responses in the Silvia et al. study remains obscure." When you’re right, you’re right. Our research does not delve into the causes of the differences between people, and this limits its contribution to a general model of creativity. (Here, we suspect that Mumford et al. [2008] would agree and that Runco [2008] and Kim [2008] would disagree.) Kogan,s comment highlights what has changed over four decades of research. Modern psychometric research, in our opinion, is less concerned with the why and how of divergent-thinking tasks. The tasks have been elevated to measures of a general trait related to creativity ability or potential. Wallach and Kogan's concern with process has been obscured by Torrance’s (2008) concern with predictive validity and Guilford's (1967) concern with structure. During the time that Wallach and Kogan (1965) were developing their research, the notion of domain-general creative abilities seemed sensible. In our research, we spoke of divergent-thinking tasks as measuring simply creativity. This draws the ire of many creativity researchers, but we may as well be candid. Nearly all research with divergent-thinking tasks presumes that these tasks measure a global trait of creativity. (This presumption, we think, is probably wrong—more on this later.) There is value in being straightforward, but most psychologists prefer to call spades "digging process actualizers" and "excavation implements." If people believe that they are measuring global creative ability, then they should call their construct creativity or creative ability. The issue of reifying divergent-thinking tasks won't be solved by calling the tasks measures of creative potential, a phrase advocated by Runco (2008). If someone has a trait-like "potential"—a stable thing that varies across people, remains stable over time, and influences observable outcomes—then the potential appears to be a trait. Why not cut to the chase and call it creativity or creative ability? What is creative potential if not the tendency to behave creatively, such as by being creative in everyday life, pursuing creative goals, and having creative accomplishments? As an analogy, consider fluid intelligence. We could call it the ability to solve novel problems, or we could call it the potential to solve novel problems. Our predictions and assessment are the same, so we do not gain much by calling it potential. Some New Signs of Madness To add to the critical cacophony, we add our own criticisms, thereby rending whatever coherence this reply may
منابع مشابه
The influence of using Gilles Deleuze's poststructuralist thoughts in improving educational space
This studychr('39')s primary purpose is the influence of Gilles Deleuzechr('39')s poststructuralist thinking on improving the quality of educational space. Therefore, this study seeks to answer two main questions; first, the difference between rhizomatic education and the tree model of education. Second, how can rhizomatic thinking be effective in improving the educational environmentchr('39')s...
متن کاملSurvey the effect of drawing and implementing of concept map in a group method with a computer on the divergent thinking of nursing students
Introduction: Concept map is one of the effective methods in the evolution of medical education, which leads to the development of learning problem solving, group skills and creativity. The purpose of this study was to determine the effect of drawing and implementing of concept map in a group method with a computer on the divergent thinking of nursing students. Materials and methods: The presen...
متن کاملRejoinder to the Discussion of “Intrinsic Shape Analysis: Geodesic Principal Component Analysis for Riemannian Manifolds under Isometric Lie Group Actions”
emerge from the ample comments provided by the discussants. These comments have been given from the individual perspectives of expertise in quite different fields which interestingly allow to connect originally disjoint strains of thoughts. For this reason we organize our rejoinder by following these specific issues and perspectives, rather than by addressing each contribution separately and th...
متن کاملThe Effect of Emotional Intelligence on the Critical Thinking of Librarians in the National Library of Iran
Background and Aim: Thinking is an active, purposeful, organized cognitive process which is used to mean the world. Successful thinking will solve the problems that we face with them constantly, make clever decisions, and achieve the goals we determin in our lives. Critical thinking is a process in which we examine our thoughts and opinions, and get a better understanding. Asking, analyzing and...
متن کاملThe validity and reliability of the divergent thinking questionnaire among primary school students.
Introduction: Divergent thinking increases the ability of children to solve problems and provides them with a wealth of associations and possible solutions. The purpose of this study was to investigate the validity and reliability of the divergent thinking questionnaire among primary school students. Method: The statistical population included all 7-year-old students of public schools in Zaheda...
متن کاملStrong evidence for multiple psychosis susceptibility genes - a rejoinder to Crow.
Aleman A, Larøi F (2008). Hallucinations : The Science of Idiosyncratic Perception. American Psychological Association : Washington, DC. Bentall RP (2003). Madness Explained : Psychosis and Human Nature. Penguin : London. Costello CG (1992). Research on symptoms versus research on syndromes : arguments in favour of allocating more research time to the study of symptoms. British Journal of Psych...
متن کامل